{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Semantic Chunking with Chonkie and Model2Vec**\n",
    "\n",
    "Semantic chunking is a task of identifying the semantic boundaries of a piece of text. In this tutorial, we will use the [Chonkie](https://github.com/bhavnicksm/chonkie) library to perform semantic chunking on the book War and Peace. Chonkie is a library that provides a lightweight and fast solution to semantic chunking using pre-trained models. It supports our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) out of the box, which we will be using in this tutorial.\n",
    "\n",
    "After chunking our text, we will be using [Vicinity](https://github.com/MinishLab/vicinity), a lightweight nearest neighbors library, to create an index of our chunks and query them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install the necessary libraries\n",
    "!pip install -q datasets model2vec numpy tqdm vicinity \"chonkie[semantic]\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the necessary libraries\n",
    "import random \n",
    "import re\n",
    "import requests\n",
    "from time import perf_counter\n",
    "from chonkie import SDPMChunker\n",
    "from model2vec import StaticModel\n",
    "from vicinity import Vicinity\n",
    "\n",
    "random.seed(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Loading and pre-processing**\n",
    "\n",
    "First, we will download War and Peace and apply some basic pre-processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# URL for War and Peace on Project Gutenberg\n",
    "url = \"https://www.gutenberg.org/files/2600/2600-0.txt\"\n",
    "\n",
    "# Download the book\n",
    "response = requests.get(url)\n",
    "book_text = response.text\n",
    "\n",
    "def preprocess_text(text: str, min_length: int = 5):\n",
    "    \"\"\"Basic text preprocessing function.\"\"\"\n",
    "    text = text.replace(\"\\n\", \" \")\n",
    "    text = text.replace(\"\\r\", \" \")\n",
    "    sentences = re.findall(r'[^.!?]*[.!?]', text)\n",
    "    # Filter out sentences shorter than the specified minimum length\n",
    "    filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) >= min_length]\n",
    "    # Recombine the filtered sentences\n",
    "    return ' '.join(filtered_sentences)\n",
    "\n",
    "# Preprocess the text\n",
    "book_text = preprocess_text(book_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Chunking with Chonkie**\n",
    "\n",
    "Next, we will use Chonkie to chunk our text into semantic chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of chunks: 4436\n",
      "Time taken: 1.6311538339941762\n"
     ]
    }
   ],
   "source": [
    "# Initialize a SemanticChunker from Chonkie with the potion-base-8M model\n",
    "chunker = SDPMChunker(\n",
    "    embedding_model=\"minishlab/potion-base-32M\",\n",
    "    chunk_size = 512, \n",
    "    skip_window=5, \n",
    "    min_sentences=3\n",
    ")\n",
    "\n",
    "# Chunk the text\n",
    "time = perf_counter()\n",
    "chunks = chunker.chunk(book_text)\n",
    "print(f\"Number of chunks: {len(chunks)}\")\n",
    "print(f\"Time taken: {perf_counter() - time}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that's it, we chunked the entirety of War and Peace in ~2 seconds. Not bad! Let's look at some example chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " Wait and we shall  see! As if fighting were fun. They are  like children from whom one can’t get any sensible account of what has  happened because they all want to show how well they can fight. But  that’s not what is needed now. “And what ingenious maneuvers they all propose to me! \n",
      "\n",
      " The first thing he saw on riding up to the space  where Túshin’s guns were stationed was an unharnessed horse with a  broken leg, that lay screaming piteously beside the harnessed horses. Blood was gushing from its leg as from a spring. Among the limbers lay  several dead men. \n",
      "\n",
      " Out of an army  of a hundred thousand we must expect at least twenty thousand wounded,  and we haven’t stretchers, or bunks, or dressers, or doctors enough for  six thousand. We have ten thousand carts, but we need other things as  well—we must manage as best we can! ”    The strange thought that of the thousands of men, young and old, who  had stared with merry surprise at his hat (perhaps the very men he had  noticed), twenty thousand were inevitably doomed to wounds and death  amazed Pierre. “They may die tomorrow; why are they thinking of anything but death? ”  And by some latent sequence of thought the descent of the Mozháysk hill,  the carts with the wounded, the ringing bells, the slanting rays of the  sun, and the songs of the cavalrymen vividly recurred to his mind. “The cavalry ride to battle and meet the wounded and do not for a moment  think of what awaits them, but pass by, winking at the wounded. Yet from  among these men twenty thousand are doomed to die, and they wonder at my  hat! ” thought Pierre, continuing his way to Tatárinova. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Print a few example chunks\n",
    "for _ in range(3):\n",
    "    chunk = random.choice(chunks)\n",
    "    print(chunk.text, \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those look good. Next, let's create a vector search index with Vicinity and Model2Vec.\n",
    "\n",
    "**Creating a vector search index**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken: 2.269912125004339\n"
     ]
    }
   ],
   "source": [
    "# Initialize an embedding model and encode the chunk texts\n",
    "time = perf_counter()\n",
    "model = StaticModel.from_pretrained(\"minishlab/potion-base-32M\")\n",
    "chunk_texts = [chunk.text for chunk in chunks]\n",
    "chunk_embeddings = model.encode(chunk_texts)\n",
    "\n",
    "# Create a Vicinity instance\n",
    "vicinity = Vicinity.from_vectors_and_items(vectors=chunk_embeddings, items=chunk_texts)\n",
    "print(f\"Time taken: {perf_counter() - time}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Done! We embedded all our chunks and created an in index in ~1.5 seconds. Now that we have our index, let's query it with some queries.\n",
    "\n",
    "**Querying the index**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query: Emperor Napoleon\n",
      "--------------------------------------------------\n",
      " In 1808 the Emperor Alexander went to Erfurt for a fresh interview with  the Emperor Napoleon, and in the upper circles of Petersburg there was  much talk of the grandeur of this important meeting. CHAPTER XXII    In 1809 the intimacy between “the world’s two arbiters,” as  Napoleon and Alexander were called, was such that when Napoleon declared  war on Austria a Russian corps crossed the frontier to co-operate with  our old enemy Bonaparte against our old ally the Emperor of Austria, and  in court circles the possibility of marriage between Napoleon and one  of Alexander’s sisters was spoken of. But besides considerations of  foreign policy, the attention of Russian society was at that time keenly  directed on the internal changes that were being undertaken in all the  departments of government. Life meanwhile—real life, with its essential interests of health and  sickness, toil and rest, and its intellectual interests in thought,  science, poetry, music, love, friendship, hatred, and passions—went on  as usual, independently of and apart from political friendship or enmity  with Napoleon Bonaparte and from all the schemes of reconstruction. BOOK SIX: 1808 - 10            CHAPTER I    Prince Andrew had spent two years continuously in the country. All the plans Pierre had attempted on his estates—and constantly  changing from one thing to another had never accomplished—were carried  out by Prince Andrew without display and without perceptible difficulty. \n",
      "\n",
      " CHAPTER XXVI    On August 25, the eve of the battle of Borodinó, M. de Beausset, prefect  of the French Emperor’s palace, arrived at Napoleon’s quarters at  Valúevo with Colonel Fabvier, the former from Paris and the latter from  Madrid. Donning his court uniform, M. de Beausset ordered a box he had  brought for the Emperor to be carried before him and entered the first  compartment of Napoleon’s tent, where he began opening the box while  conversing with Napoleon’s aides-de-camp who surrounded him. Fabvier, not entering the tent, remained at the entrance talking to some  generals of his acquaintance. The Emperor Napoleon had not yet left his bedroom and was finishing his  toilet. \n",
      "\n",
      " In Russia there  was an Emperor, Alexander, who decided to restore order in Europe and  therefore fought against Napoleon. In 1807 he suddenly made friends  with him, but in 1811 they again quarreled and again began killing many  people. Napoleon led six hundred thousand men into Russia and captured  Moscow; then he suddenly ran away from Moscow, and the Emperor  Alexander, helped by the advice of Stein and others, united Europe to  arm against the disturber of its peace. All Napoleon’s allies suddenly  became his enemies and their forces advanced against the fresh forces he  raised. The Allies defeated Napoleon, entered Paris, forced Napoleon to  abdicate, and sent him to the island of Elba, not depriving him of the  title of Emperor and showing him every respect, though five years before  and one year later they all regarded him as an outlaw and a brigand. Then Louis XVIII, who till then had been the laughingstock both of the  French and the Allies, began to reign. And Napoleon, shedding tears  before his Old Guards, renounced the throne and went into exile. \n",
      "\n",
      "Query: The battle of Austerlitz\n",
      "--------------------------------------------------\n",
      " Behave as you did at  Austerlitz, Friedland, Vítebsk, and Smolénsk. Let our remotest posterity  recall your achievements this day with pride. Let it be said of each of  you: “He was in the great battle before Moscow! \n",
      "\n",
      " By a strange coincidence, this task, which turned out to be a most  difficult and important one, was entrusted to Dokhtúrov—that same modest  little Dokhtúrov whom no one had described to us as drawing up plans  of battles, dashing about in front of regiments, showering crosses on  batteries, and so on, and who was thought to be and was spoken of as  undecided and undiscerning—but whom we find commanding wherever the  position was most difficult all through the Russo-French wars from  Austerlitz to the year 1813. At Austerlitz he remained last at the  Augezd dam, rallying the regiments, saving what was possible when all  were flying and perishing and not a single general was left in the rear  guard. Ill with fever he went to Smolénsk with twenty thousand men  to defend the town against Napoleon’s whole army. \n",
      "\n",
      " “Nothing is truer or sadder. These gentlemen ride onto the bridge alone and wave white handkerchiefs;  they assure the officer on duty that they, the marshals, are on  their way to negotiate with Prince Auersperg. He lets them enter the  tête-de-pont. * They spin him a thousand gasconades, saying that  the war is over, that the Emperor Francis is arranging a meeting with  Bonaparte, that they desire to see Prince Auersperg, and so on. The  officer sends for Auersperg; these gentlemen embrace the officers, crack  jokes, sit on the cannon, and meanwhile a French battalion gets to  the bridge unobserved, flings the bags of incendiary material into  the water, and approaches the tête-de-pont. At length appears the  lieutenant general, our dear Prince Auersperg von Mautern himself. Flower of the Austrian army, hero of the Turkish wars! Hostilities are ended, we can shake one another’s hand. The  Emperor Napoleon burns with impatience to make Prince Auersperg’s  acquaintance. \n",
      "\n",
      "Query: Paris\n",
      "--------------------------------------------------\n",
      " Paris is Talma, la  Duchénois, Potier, the Sorbonne, the boulevards,” and noticing that  his conclusion was weaker than what had gone before, he added quickly:  “There is only one Paris in the world. You have been to Paris and have  remained Russian. Well, I don’t esteem you the less for it. \n",
      "\n",
      " Look at  our youths, look at our ladies! The French are our Gods: Paris is our  Kingdom of Heaven. ”    He began speaking louder, evidently to be heard by everyone. “French dresses, French ideas, French feelings! \n",
      "\n",
      " “Oh yes, one sees that plainly. A man who doesn’t know Paris  is a savage. You can tell a Parisian two leagues off. \n",
      "\n"
     ]
    }
   ],
   "source": [
    "queries = [\"Emperor Napoleon\", \"The battle of Austerlitz\", \"Paris\"]\n",
    "for query in queries:\n",
    "    print(f\"Query: {query}\\n{'-' * 50}\")\n",
    "    query_embedding = model.encode(query)\n",
    "    results = vicinity.query(query_embedding, k=3)[0]\n",
    "\n",
    "    for result in results:\n",
    "        print(result[0], \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These indeed look like relevant chunks, nice! That's it for this tutorial. We were able to chunk, index, and query War and Peace in about 3.5 seconds using Chonkie, Vicinity, and Model2Vec. Lightweight and fast, just how we like it."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
